[LEADS-348] evaluation latency by xmican10 · Pull Request #235 · lightspeed-core/lightspeed-evaluation

xmican10 · 2026-05-14T13:55:29Z

Description

Depends on #230
Further changes for LEADS-348:

result.execution_time was renamed to result.evaluation_latency which represents better that it is measuring the evaluation
rounding of result.evaluation_latency was removed for consistency, since other timing metrics are not being rounded
result.execution_time was reintroduced (for backward compatibility), but currently measures the whole api execution and evaluation per turn
test_evaluation.py was refactored - redundant mocking was moved to pytest fixture (~500 lines reduced)

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Added a separate evaluation latency metric and now reports execution time as the sum of evaluation + agent latencies; the new field appears in CSV exports and JSON outputs.
Tests
- Updated and expanded tests and fixtures to validate evaluation latency capture, execution-time computation, and CSV/DB output consistency.

coderabbitai · 2026-05-14T13:55:35Z

Walkthrough

This PR refactors evaluation result timing from a single execution_time field to a decomposed model where evaluation_latency and agent_latency are measured separately and execution_time equals their sum. The change spans data model, evaluator logic, storage, output, and comprehensive test updates.

Changes

Timing Model Refactoring

Layer / File(s)	Summary
Schema and configuration updates `config/system.yaml`, `src/lightspeed_evaluation/core/constants.py`, `src/lightspeed_evaluation/core/models/data.py`	Configuration and data model fields updated to support `evaluation_latency` as a CSV column and field; timing fields now use float defaults with clarified descriptions.
Evaluator timing measurement and computation `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Added `_measure_latency()` using `time.perf_counter()`. `evaluate_metric()` captures `start_time`, computes `evaluation_latency`, records `agent_latency`, and sets `execution_time = evaluation_latency + agent_latency`. Error paths now pass `start_time` to `_create_error_result()`.
Storage and output serialization `src/lightspeed_evaluation/core/storage/sql_storage.py`, `src/lightspeed_evaluation/core/output/generator.py`	ORM model adds nullable `evaluation_latency` column and mapper persists it. JSON/CSV output generation now includes `evaluation_latency`; per-column formatting for `execution_time` was simplified.
Test fixtures and system config test `tests/script/conftest.py`, `tests/unit/pipeline/evaluation/conftest.py`, `tests/unit/core/system/test_loader.py`	Script fixtures updated to use `evaluation_latency`. New `evaluator` pytest fixture wires a MetricsEvaluator with mocked dependencies. Added test to load and validate default `config/system.yaml`.
Data model, storage and output tests `tests/unit/core/models/test_data.py`, `tests/unit/core/models/test_summary.py`, `tests/unit/core/output/test_generator.py`, `tests/unit/core/storage/test_sql_storage.py`	Unit tests updated to assert defaults/validation for `evaluation_latency`, test fixtures adjusted, and storage/output tests check `evaluation_latency` persistence and CSV/DB column presence.
Evaluator behavior and error tests `tests/unit/pipeline/evaluation/test_evaluator.py`, `tests/unit/pipeline/evaluation/test_errors.py`	Evaluator test suite refactored to use injected `evaluator` fixture; added/updated assertions for `evaluation_latency`, `agent_latency`, `execution_time` computation, error-result field preservation, expected-response handling, and token accounting (no double counting) across multi-path and error scenarios.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

lightspeed-core/lightspeed-evaluation#230: Overlapping work wiring/measuring agent_latency into EvaluationResult and related reporting/storage.

Suggested reviewers

asamal4
VladimirKadlec

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[LEADS-348] evaluation latency' clearly describes the main objective: introducing evaluation latency tracking and measurement. It directly matches the primary changes across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Line 817: The JSON summary currently adds "evaluation_latency":
r.evaluation_latency but removed the older "execution_time" field; restore
compatibility by including "execution_time" in the emitted dict (e.g., set
"execution_time": r.execution_time or, if r has no execution_time, set it to the
same value as r.evaluation_latency) alongside "evaluation_latency" in the output
constructed where the dict containing "evaluation_latency" is created so older
consumers still receive execution_time.

In `@src/lightspeed_evaluation/core/storage/sql_storage.py`:
- Line 60: The SQL schema and persistence code currently define/persist only
evaluation_latency, dropping the previously stored execution_time; add back
execution_time = Column(Float, nullable=True) alongside evaluation_latency in
the model definition and update the persistence logic that currently writes
evaluation_latency (e.g., in the function that inserts/updates evaluation
records—look for code referencing evaluation_latency around the save/insert
method) to also set execution_time from the same source value so existing
consumers continue receiving execution_time.

In `@tests/unit/pipeline/evaluation/test_errors.py`:
- Line 160: The test currently asserts results[0].evaluation_latency but misses
verifying the backward-compatible execution_time field on error results; update
the test in tests/unit/pipeline/evaluation/test_errors.py to also assert that
results[0].execution_time equals 0.0 alongside the existing evaluation_latency
assertion so error results explicitly include the legacy execution_time field.

In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 583-605: The test
test_execution_time_conversation_level_averages_agent_latency incorrectly
asserts result.agent_latency == 4.0; update the assertion to expect the average
agent latency of the conversation (2.0) since TurnData instances have
agent_latency 1.0 and 3.0 and EvaluationRequest.for_conversation should compute
the mean; change the assertion referencing result.agent_latency in this test to
assert result.agent_latency == 2.0 and keep the other
execution_time/evaluation_latency checks unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56132854-62fe-4236-8dc5-3380d17e64a8

📥 Commits

Reviewing files that changed from the base of the PR and between 377a66b and e9c4b17.

📒 Files selected for processing (15)

config/system.yaml
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/script/conftest.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/core/system/test_loader.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_errors.py
tests/unit/pipeline/evaluation/test_evaluator.py

asamal4

Thanks. A minor typo !!

VladimirKadlec

LGTM 👍

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/unit/core/storage/test_sql_storage.py (1)
269-314: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add assertion for execution_time in comprehensive storage test.

The test test_all_fields_stored verifies that all EvaluationResult fields are persisted correctly. It sets evaluation_latency=2.5 (line 283) and asserts row.evaluation_latency == 2.5 (line 312), but doesn't verify execution_time is stored correctly.

Per lines 417-428, both evaluation_latency and execution_time are DB columns. This test should assert row.execution_time to ensure backward compatibility and verify the relationship between timing fields is preserved through storage.
🔍 Suggested assertion to add
         assert row is not None
         assert row.conversation_group_id == "conv_full"
         assert row.score == 0.92
         assert row.evaluation_latency == 2.5
+        assert row.execution_time is not None  # Verify backward compat field is stored
         assert row.tool_calls == '[{"name": "search"}]'
If execution_time should equal evaluation_latency + agent_latency, verify that relationship:
# Assuming agent_latency defaults to 0 when not set
assert row.execution_time == 2.5  # or appropriate computed value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/storage/test_sql_storage.py` around lines 269 - 314, The
test_all_fields_stored test creates an EvaluationResult with
evaluation_latency=2.5 but never asserts the execution_time DB column; update
the test to read row.execution_time and assert it equals the expected value
(likely 2.5 if agent latency defaults to 0 or evaluation_latency + agent_latency
if agent_latency is present) after saving via
SQLStorageBackend.initialize/save_result; reference the EvaluationResult
instance, the evaluation_latency field, and the execution_time DB column to
ensure the stored execution_time is validated.

🧹 Nitpick comments (3)

tests/unit/core/models/test_summary.py (1)
29-29: ⚡ Quick win

Consider including execution_time in test fixture for completeness.

The fixture now sets evaluation_latency: 1.0 but doesn't set execution_time. Per the PR summary, execution_time is retained for backward compatibility and should equal evaluation_latency + agent_latency. Consider explicitly setting execution_time in the fixture to create realistic test data and verify the field is properly handled throughout the summary generation logic.
♻️ Suggested addition to fixture
 _RESULT_DEFAULTS: dict[str, Any] = {
     "conversation_group_id": "conv1",
     "tag": "eval",
     "turn_id": "turn1",
     "metric_identifier": "ragas:faithfulness",
     "result": "PASS",
     "score": 0.85,
     "threshold": 0.7,
     "reason": "Good",
     "evaluation_latency": 1.0,
+    "agent_latency": 0.5,
+    "execution_time": 1.5,
     "judge_llm_input_tokens": 100,
     "judge_llm_output_tokens": 50,
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` at line 29, The test fixture in
tests/unit/core/models/test_summary.py sets "evaluation_latency" but omits
"execution_time"; update the fixture to include "execution_time" and set it
equal to evaluation_latency + agent_latency (so execution_time =
evaluation_latency + agent_latency) to match the PR behavior and ensure summary
generation and backward-compat checks (e.g., in any assertions referencing
execution_time) use realistic data.
tests/unit/core/storage/test_sql_storage.py (1)
53-53: ⚡ Quick win

Consider setting execution_time in test fixture for backward compatibility verification.

The fixture sets evaluation_latency=1.5 but doesn't set execution_time or agent_latency. Per the PR summary and line 428 of this file, execution_time is retained as a DB column for backward compatibility. The test fixture should either:

Explicitly set execution_time to verify it's stored correctly, or

Verify the model auto-computes execution_time = evaluation_latency + agent_latency

Without this, tests may not catch issues where execution_time is NULL or incorrectly computed.
♻️ Suggested fixture enhancement
     return EvaluationResult(
         conversation_group_id="conv_001",
         tag="test",
         turn_id="turn_1",
         metric_identifier="ragas:faithfulness",
         metric_metadata='{"threshold": 0.8}',
         result="PASS",
         score=0.95,
         threshold=0.8,
         reason="Good response",
         query="What is Python?",
         response="Python is a programming language.",
         evaluation_latency=1.5,
+        agent_latency=0.0,
+        execution_time=1.5,
         api_input_tokens=100,
         api_output_tokens=50,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/storage/test_sql_storage.py` at line 53, The test fixture in
tests/unit/core/storage/test_sql_storage.py currently sets
evaluation_latency=1.5 but omits execution_time and agent_latency; update the
fixture (the evaluation row/setup used in the tests) to either explicitly set
execution_time (e.g., execution_time=<expected value>) to verify DB persistence,
or set agent_latency and then add an assertion in the relevant test that
execution_time == evaluation_latency + agent_latency so the code path that
computes/stores execution_time is exercised; refer to the fields execution_time,
evaluation_latency, and agent_latency when making the change.
tests/unit/core/models/test_data.py (1)
472-472: ⚡ Quick win

Add an execution_time default assertion to the test
EvaluationResult.execution_time is defined with default=0.0 alongside evaluation_latency=0.0; extend tests/unit/core/models/test_data.py (line 472) to assert result.execution_time == 0.0 in addition to result.evaluation_latency == 0.0.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_data.py` at line 472, The test currently asserts
result.evaluation_latency == 0 but misses the sibling default; add an assertion
for EvaluationResult.execution_time by verifying result.execution_time == 0.0
immediately alongside the existing evaluation_latency check (i.e., update the
test that references result to include result.execution_time == 0.0).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/unit/core/storage/test_sql_storage.py`:
- Around line 269-314: The test_all_fields_stored test creates an
EvaluationResult with evaluation_latency=2.5 but never asserts the
execution_time DB column; update the test to read row.execution_time and assert
it equals the expected value (likely 2.5 if agent latency defaults to 0 or
evaluation_latency + agent_latency if agent_latency is present) after saving via
SQLStorageBackend.initialize/save_result; reference the EvaluationResult
instance, the evaluation_latency field, and the execution_time DB column to
ensure the stored execution_time is validated.

---

Nitpick comments:
In `@tests/unit/core/models/test_data.py`:
- Line 472: The test currently asserts result.evaluation_latency == 0 but misses
the sibling default; add an assertion for EvaluationResult.execution_time by
verifying result.execution_time == 0.0 immediately alongside the existing
evaluation_latency check (i.e., update the test that references result to
include result.execution_time == 0.0).

In `@tests/unit/core/models/test_summary.py`:
- Line 29: The test fixture in tests/unit/core/models/test_summary.py sets
"evaluation_latency" but omits "execution_time"; update the fixture to include
"execution_time" and set it equal to evaluation_latency + agent_latency (so
execution_time = evaluation_latency + agent_latency) to match the PR behavior
and ensure summary generation and backward-compat checks (e.g., in any
assertions referencing execution_time) use realistic data.

In `@tests/unit/core/storage/test_sql_storage.py`:
- Line 53: The test fixture in tests/unit/core/storage/test_sql_storage.py
currently sets evaluation_latency=1.5 but omits execution_time and
agent_latency; update the fixture (the evaluation row/setup used in the tests)
to either explicitly set execution_time (e.g., execution_time=<expected value>)
to verify DB persistence, or set agent_latency and then add an assertion in the
relevant test that execution_time == evaluation_latency + agent_latency so the
code path that computes/stores execution_time is exercised; refer to the fields
execution_time, evaluation_latency, and agent_latency when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: be024615-6acf-4965-a895-19197c932bc2

📥 Commits

Reviewing files that changed from the base of the PR and between e9c4b17 and 6d2fd1d.

📒 Files selected for processing (15)

config/system.yaml
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/script/conftest.py
tests/unit/core/models/test_data.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/core/system/test_loader.py
tests/unit/pipeline/evaluation/conftest.py
tests/unit/pipeline/evaluation/test_errors.py
tests/unit/pipeline/evaluation/test_evaluator.py

✅ Files skipped from review due to trivial changes (1)

src/lightspeed_evaluation/core/constants.py

xmican10 force-pushed the LEADS-348-evaluation-latency branch from 3eed19a to 14e437e Compare May 15, 2026 08:14

xmican10 changed the title ~~Leads 348 evaluation latency~~ [LEADS-348] evaluation latency May 15, 2026

xmican10 force-pushed the LEADS-348-evaluation-latency branch 2 times, most recently from 8eb287c to e9c4b17 Compare May 20, 2026 07:42

xmican10 marked this pull request as ready for review May 20, 2026 07:44

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/output/generator.py

Comment thread src/lightspeed_evaluation/core/storage/sql_storage.py

Comment thread tests/unit/pipeline/evaluation/test_errors.py

Comment thread tests/unit/pipeline/evaluation/test_evaluator.py Outdated

xmican10 force-pushed the LEADS-348-evaluation-latency branch from e9c4b17 to 457af2f Compare May 20, 2026 08:12

asamal4 reviewed May 25, 2026

View reviewed changes

Comment thread tests/unit/pipeline/evaluation/test_evaluator.py Outdated

xmican10 force-pushed the LEADS-348-evaluation-latency branch from 457af2f to 39af396 Compare May 25, 2026 11:13

VladimirKadlec approved these changes May 25, 2026

View reviewed changes

Measuring execution time and evaluation latency

6d2fd1d

xmican10 force-pushed the LEADS-348-evaluation-latency branch from 39af396 to 6d2fd1d Compare May 26, 2026 07:57

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

asamal4 merged commit 59240a0 into lightspeed-core:main May 26, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LEADS-348] evaluation latency#235

[LEADS-348] evaluation latency#235
asamal4 merged 1 commit into
lightspeed-core:mainfrom
xmican10:LEADS-348-evaluation-latency

xmican10 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

VladimirKadlec left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

xmican10 commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmican10 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading